On-line learning of non-monotonic rules by simple 
perceptron 



Jun-ichi Inouef §, Hidetoshi Nishimorif and Yoshiyuki Kabashimaij: 

f Department of Physics, Tokyo Institute of Technology, Oh-okayama, Meguro-ku, 
Tokyo 152, Japan 

% Department of Computational Intelligence and Systems Science, Interdisciplinary 
Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama 
226, Japan 

Abstract. We study the generalization ability of a simple perceptron which learns 
unlcarnable rules. The rules are presented by a teacher perceptron with a non- 
monotonic transfer function. The student is trained in the on-line mode. The 
asymptotic behaviour of the generalization error is estimated under various conditions. 
Several learning strategies are proposed and improved to obtain the theoretical lower 
bound of the generalization error. 



PACS numbers: 87.00, 02.50, 05.90 



Short title: On-line learning of non-monotonic rules 
February 1, 2008 



§ E-mail address: jinoue@stat.phys.titech.ac.jp 



2 

1. Introduction 



One of the important features of feed-forward neural networks is their ability of learning 
a rule from examples JL], || ||. The student network can adopt its synaptic weights 
following a set of examples given from the teacher network so that it can make 
predictions on the output for an input which has been not shown before. Learning of 
unlearnable rules by a perceptron is a particularly interesting issue because the student 
usually does not know the structure of the teacher in the real world. For machine 
learning, it is important to improve the learning scheme and minimize the prediction 
error even if it is impossible to exactly reproduce the input-output relation of the teacher. 
Only a few papers have appeared concerning learning of unlearnable rules where the 
teacher and the student have different structures H [5], || . 

In this paper we study the generalization ability of a simple perceptron using the 
on-line algorithm from a teacher perceptron with a non-monotonic transfer function of 
reversed-wedge type that have been investigated as an associative memory fl7L RL EJ 



and a perceptron [10, 11 1. If a simple monotonic perceptron learns a rule from 
examples presented by a non-monotonic perceptron, the generalization error remains 
non-vanishing even if an infinite number of examples are presented by the teacher. We 
study the limiting value and asymptotic behaviour of the generalization error in such 
unlearnable cases. 

This paper is composed of nine sections. In the next section, the problem 
is formulated and the general properties of generalization error are investigated. 
Perceptron and Hebbian learning algorithms in the on-line scheme are investigated in 
section 3. For each learning scheme, we calculate the asymptotic behaviour of learning 
curve. In section 4 we investigate the effects of output noise on learning processes. In 
section 5 we introduce the optimal learning rate and calculate the optimal generalization 
error. The optimal learning rate obtained in section 5 contains an unknown parameter 
for the student in some contradiction to the idea of learning because the learning process 
depends upon the unknown teacher parameter. Therefore, in section 6 we introduce a 
learning rate independent of the unknown parameter and optimize the rate to achieve 
a faster convergence of generalization error. In section 7, we allow the student to ask 
queries under the Hebbian learning algorithm. It is shown that learning is accelerated 
considerably if the learning rate is optimized. In section 8 we optimize the learning 
dynamics by a weight-decay term to avoid an over-training problem in Hebbian learning 
observed in section 3. The last section contains summary and discussions. 



3 



2. Generic properties of generalization error 

Our problem is denned as follows. The teacher signal is provided by a single-layer 
perceptron with an ^-dimensional weight vector J° and a non-monotonic (reversed- 
wedge) transfer function 

T a (v) = aga[v(a-v)(a + v)] (2.1) 

where v=VN (J°-x)/\J°\, x is the input vector normalized to unity, a is the width of the 
reversed wedge, and sign denotes the sign function. The student is a simple perceptron 
with the weight vector J whose output is 

S(u) = sign(w) (2.2) 

where u= \/N(J-x)/\J\. The components of x are drawn independently from auniform 
distribution on the iV-dimensional unit sphere. The student can learn the rule of the 
teacher perfectly if and only if a = oo. 

It is convenient to introduce the following two order parameters. One is the overlap 
between J° and J 

T°- T 

R = WW\ (23) 

and the other is the norm of the student weight vector 

' = % ( 2 - 4 ) 



(2.5) 



<N 

In the limit N — * oo the random variables u and v obey the normal distribution 



PfiM = WT=7F exp 



u + v — 2Ruv 



2(1 -R 2 ) 

The generalization error e g , or the student probability of producing a wrong answer, can 
be obtained by integrating the above distribution over the region satisfying T a (v)^S(u) 
in the two-dimensional u-v space. After simple calculations we find 

* s e{r) = 2 f bvh ( A) + 2 r d - h (vBw) < 2 - 6 » 

where H(x) = f£° Dv and Dv = dv exp(— v 2 /2)/y / 27r. 

In figure 1 we plot E(R)(= e g ) for several values of the parameter a. From this 
figure, we see that for a = oo (the learnable limit), e g goes to zero when R approaches 1. 
In contrast, for a = 0, e g goes to zero when R reaches —1. If a is finite, the generalization 
error shows highly non-trivial behaviour. The critical value i£* of the order parameter 
is defined as the point where E(R) is locally minimum. Explicitly, 

d _ / 21og2 ~ a 2 (27) 
R *-~i 21og2 (2 ' 7) 
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which exists for a < a cl = a/2 log 2 = 1.18. We plot in figure 2 the value of the global 
minimum of E(R), the smallest possible generalization error irrespective of learning 
algorithms. In figure 3, we show the value of R which gives the global minimum. We 
notice that for a < a C 2 = 0.80, E\ OC3 i = E(R = R*) is also the global minimum, and for 
a > a C 2, the global minimum is E(R = 1). Clearly the optimal generalization error is 
obtained by training the student weight vector J so that R goes to 1 (or J = J°). This 
critical value a c2 is given by the condition E(R = 1) = Ei oca x. 

On the other hand, for a < a C 2, the optimal generalization cannot be achieved 
even if the student succeeds in finding J° completely. In this curious case, the optimal 
generalization is obtained by training the student so that the student finds his weight 
vector which satisfies R = R* instead of R = 1. At a = a C 2 the generalization error has 
the maximum value as seen in figure 2. 



3. Dynamics of noiseless learning 

We now investigate the learning dynamics with specific learning rules. 



3.1. Perceptron learning 



We first investigate the perceptron learning 

jm+i = J m_ Q(_ Ta ^s( u )) sign(w) x 



(3-1) 



where is the step function and m stands for the discrete time step of dynamics or the 
number of presented examples. The standard procedure (see e.g. [|12|]) yields the rate 
of changes of I and R in the limit iV — > oo as 



da 
dR 
da 



1 
7 
1 

P 



E(R) 



F{R)l 



-E(R) + (F(R)R-G{R))l 



(3.2) 
(3.3) 



where E(R) 
brackets <C- ■ 



2 
R 
' 2" 

= «1>a, F(R) = <Msign(M)> R and G(R) = <w sign(M)> R . The 
■^>r stand for the averaging with respect to the distribution P R (u,v), the 
integration being carried out over the region where the student and the teacher give 
different outputs T a (v) ^ S{u). Hence the definition of E(R) coincides with that of the 
generalization error, E(R) = e g , as used in the previous section. The other quantities 
F(R) and G(R) are evaluated in a straightforward manner as 

R ... 1 



F(R) 



G(R) 



'2ti 



2tt 



;i-2A) + 



2A) + 



R 



2tt 



(3.4) 



(3.5) 



5 

where A = e~ a2//2 . 



3.1.1. Numerical analysis of differential equations We have numerically solved 
equations (3.2) and (3.3). The resulting flows of R and I are shown in figure 4 for 
a = oo under several initial conditions. This figure indicates that R reaches 1 (perfect 
generalization state) in the limit of a — > oo and I — > oo for any initial condition. For 
finite a, however, behaviour of the flow strongly depends on the initial condition. If we 
take a large I as the initial value, the perfect generalization state (R = 1) is achieved 
after / decreases at intermediate steps. If we choose initial R close to 1 and small /, the 
perfect generalization is achieved after a decrease of R is observed. Similar phenomena 



have been reported in the K = 2 parity machine fl2|l . Next we display the flows of R 



and I for unlearnable cases, for example, a = 2.0 in figure 5. There exists a stable and 
a-dependent fixed point (R ,l ). The generalization of the student halts at this fixed 
point even if the flow of R and I starts from R = 1 and large I. 



3.1.2. Asymptotic analysis of the learning curve When the rule is learnable (a = 
oo), it is straightforward to check the asymptotic behaviour e g = fca _1//3 , k = 
V2(3V2)- 1/3 /n, from equations (3.2) and (3.3). When a is finite, the fixed point value 
of R is obtained from equations (3.2) - ( |3.5|) as R = 1 — 2A. Substituting this R into 
E(R), we get the minimum value of the generalization error Eq = e m i n (a) for perceptron 
learning. In figures 2 and 3, we show Rq and E Q as functions of a. Figure 2 indicates 
that the learning for a = a c i=v / 2Tog2, which is obtained from the condition Rq = 0, is 
equivalent to a random guess, e m i n (a c i) = 0.5. 

Linearization of the right-hand side of equations (3.2) and (3.3) around the fixed 
point yields the behaviour of the generalization error near the fixed point. Explicit 
expressions simplify when a is large: it turns out that the generalization error decays 
toward the minimum value 



E(R) ~ 2H(a) 




exponentially as (\/2/it) exp(— 2A 2 ^ 3 a/n). 
3.2. Hebbian learning 

In the Hebbian rule the dynamics of the student weight vector is 

J m+1 = J m + T a (v)x. (3.7) 

This recursion relation of the iV-dimensional vector J is reduced to the evolution 
equations of the order parameters as 
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dR_l_ 
da~~P 



R 2 :(l-2A)(l-i? 2 )Z 



2 



(3.9) 



3.2.1. Numerical analysis of differential equations In figure 6, we plot the flows in 
the R-l plane and the generalization error for a = oo, 2.0 and a = 0.5. We started the 
dynamics with the initial condition (Rmit,hmt) = (0.01,0.1). This figure shows that R 
reaches 1 for large a and R approaches —1 for small a. In order to find this bifurcation 
point near R = 0, we approximate equation (3.9) around _R~0 as 

dR 2 . . . , 

1-2A). (3.10) 



da J2txI 



If a > a c i = \/1 log 2 = 1.18, the derivative dR/da is positive, and consequently R 
increases and eventually reaches 1 in the limit ct^oo. If a < a c \, R reaches —1 as a— >oo. 
Figure 7 shows how the generalization error behaves according to a. For a = 0.5 (< a c i), 
e g has a minimum at some intermediate a. When the generalization error e g passes 
through this value, e g begins to increase toward the limiting value e m i n (a) = 1 — 2H(a). 
Therefore, if the student learns excessively, he cannot achieve the lowest generalization 
error located at the global minimum of E(R) = e g (over-training) JTJ|. |3j. 

From figure 1 we see that R must pass through a local minimum of E(R) at R = i2* 
in order to go to the state R = —1. If the parameter a satisfies a < a C 2 = 0.80, this local 
minimum is also the global minimum. Therefore, if a < a cl , although the generalization 
error decreases until R reaches i?*, it begins to increase as soon as R passes through the 
minimum point R = R* and finally reaches a larger value at R — — 1 . 

When the parameter a lies in the range a C 2 < a < a&, the global minimum is 
located at R = 1. However, since R goes to —1 for a < a c i (see equation ( |3.10|) ), the 
generalization error increases monotonically from 0.5 (random guess) to l—2H(a)(> 0.5) 
for the parameter range a C 2 < a < a c \. We can regard this as a special case of over- 
training. We conclude that over-training appears for all a < a c \. 

3.2.2. Asymptotic analysis of the learning curve With the same technique as in the 
previous section, we obtain the asymptotic form of the generalization error when a = oo 
in the limit «->oo as 

1 1 



27T \foi 



(3.11) 



which is a well-known result [|H 



For finite a satisfying a > a c x, simple manipulations as before show that the stable 
fixed point is at R = 1 and the differential equations (3.8) and (3.9) yield the asymptotic 
form of the generalization error as 

11.. 
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The limiting value 2H(a) is the best possible value obtained in section 2. On the other 
hand, for a < a c %, 

1 1 



+ l-2H(a). 



(3.13) 



g V / 6V(1 - 2A) v 7 ^ 

The rate of approach to the asymptotic value, l/y/a, in equations (|3.12j ) and (3.13) 
agrees with the corresponding behaviour in the Gibbs learning of unlearnable rules {§]. 



4. Learning under output noise in the teacher signal 

We now consider the situation where the output of the teacher is inverted randomly 
with a rate A (<l/2) for each example. We show that the parameter a plays essentially 
the same role as output noise in the teacher signal. 



4-1. Perceptron learning 



According to references ||T2|, [15], [16] , the effect of output noise is taken into account in the 
differential equations (3.2) and (3.3) by replacing E(R), F(R) and G(R) with E\(R), 
F X (R) and G X (R) as follows 

E\(R) = (1 - X)E(R) + XE C (R) 

F\(R) = (1 — X)F(R) + XF C (R) (4.1) 
G\(R) = (1 - X)G(R) + XG C (R) 

where E c , F c and G c correspond to E, F and G, the only difference being that the 
integration is over the region satisfying T a {y) = S(u). 

We study the asymptotic behaviour of the learning curve in the limit of small noise 
level A<Cl. For the learnable case a = oo, equations (3.2) and (3.3) with (4.1) taken 



into account have the fixed point at R = Ro = 1 — 2A, / = Iq = (2\/27rA) 1 for A < 1. 
Linearization around this fixed point leads to the asymptotic behaviour 

(4.2) 



1 + 0(e 



1 - R~(l - R ) [l + C(e" 8A3/2a ) 

Therefore, the generalization error e g converges to a finite value E(R=1 -2A) = 2A 1/2 /vr 
exponentially, exp(— 8A 3//2 a). 

According to Biehl et al |16| , it is useful to distinguish two performance measures 
of on-line learning, the generalization error e g and the prediction error e p . The 
generalization error e g is the probability of disagreement between the student and the 
genuine rule of the teacher as we have discussed. On the other hand, the prediction error 
e p is the probability for disagreement between the student and the noisy teacher output 



8 



for an arbitrary input. In the present case, the prediction error e p and generalization 
error e g satisfy the relation 

e p = A + (1 -2A)e g . (4.3) 

For the unlearnable case of large but finite a under small noise level, the fixed point 
value of R is found to be Rq(\) = (1 — 2A)(1 — 2A). The expression of the fixed point 
/o(A) is too complicated and is omitted here. Linearization near this fixed point shows 
that the generalization error converges to (2/vr)A 1/2 + 2#(a) exponentially as exp(— t-a) 
for large a and small A, where 

(_ 8A 3/2 _ 2A x /2) _ ,/(-8A 3 /2 + 2AV2)2 _ (8A + 4A- X A 2 ) 
t_ = v - . (4.4) 

The prediction error is given by e p = A + (1 — 2A)e g . 



Jf.,2. Hebbian learning 

The differential equations of the order parameters for noisy Hebbian learning are 

-2A)(1-2A)Z 



dl 

da 


1 

7 


l 

2 + " 


dR 


l 


R 


da 


p 


~ ~2 



2R 



'2tt 



^7T 



(1-2A)(1-2A)(1- J R 2 )/ 



(4.5) 
(4.6) 



We plot the generalization error for a = 0.5 in figure 8 by solving these differential 
equations numerically. We saw in the previous section that the over-training appears in 
the absence of noise if a < a c \ = y / 2Tog2, which is also the case when there is small noise 
(e.g. A = 0.01). For larger A (e.g. A = 0.20), however, there appears no minimum in e g 
as a increases. This implies in terms of figure 1 that R becomes stuck at an intermediate 
R before it reaches R*. 

The asymptotic form for the noisy case can be derived simply by replacing (1 — 2 A) 
in the asymptotic form of the noiseless case with (1 — 2A)(1 — 2A). Thus A = e~ a2 / 2 
and A have the same effect on the asymptotic generalization ability. A similar effect 
is reported for the non-monotonic Hopfield model || [| which works as an associative 
memory. If we embed patterns by the Hebb rule in the network, the capacity of the 
network drastically deteriorates for small a. 



5. Optimization of learning rate 

We have so far investigated the learning processes with a fixed learning rate. In 
this section we consider optimization of the learning rate to improve the learning 
performance. It turns out that the perceptron learning with optimized learning rate 
achieves the best possible generalization error in the range a > a c ±. 



We first introduce the learning rate g(a) in our dynamics. As an example, the 
learning dynamics for the perceptron algorithm is written as 

jm+l = J m_ g ( a j Q(_ Ta ( v } S (u)) S ign(u) X. (5.1) 

This optimization procedure is different from the technique of Kinouchi and Caticha 



17| . They investigated the on-line dynamics with a general weight function f(T a (v),u) 



as 



rm+l 



J m + f(T a (v),u)T a (v)x (5.2) 



and chose f(T a , u) so that it maximizes the increase of R per learning step. In contrast, 
our optimization procedure adjusts the parameter g(a) keeping the learning algorithm 
unchanged. 

5.1. Perceptron learning 

5.1.1. Trajectory in the R-l plane The trajectories in the R-l plane can be derived 
explicitly for the optimal learning rate g O pt(o0- The differential equations with the 
learning rate g(a) are 

dl g(a) 2 E(R)/2- g(a)F(R)l 



da I 

dR -RE(R)g(a) 2 /2 + g(a) [F(R)R - G(R)} I 



(5.3) 

da - P ^)). (5.4) 

Now we choose the parameter g to maximize L(g(a)) with the aim to accelerate the 
increase of R 

[F(R) R - G(R)] I 
= • (5-5) 



Substituting this g into equations (5.3) and (5.4) and taking their ratio, we find 
dR _ \F{R) R - G(R)} R 



(5.6) 



dl [F(R) R + G(R)} I 

Using equations ( |3.4|) and ( p.5|) we obtain the trajectory in the R-l plane as 

(1 + R)-^ 1+A ^ A {1 - R)Q- A V A R = cl (5.7) 

where A = 1 — 2 A and c is a constant. 

In figures 9 and 10, we plot the above trajectory for a = 2.0 and 0.5, respectively, 
by adjusting c to reproduce the initial conditions (i?mit> hmt) — (0.01,0.10), (0.01, 1.00) 
and (0.01, 2.00). These figures indicate that the student goes to the state of R = 1 after 
infinite learning steps (a^oo) for any initial condition. The final value of / depends on 
a. If a is small (e.g., 0.5), / increases indefinitely as a — ► oo. On the other hand, for 
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larger a, I is seen to decrease as a goes to oo. We investigate this a- dependence of I in 
more detail in the next subsection. 

We plot the corresponding generalization error in figures 11 and 12. We see that for 
a = 2.0, the generalization ability is improved significantly. However, for a = 0.5, the 
generalization ability becomes worse than that for g = 1 (the unoptimized case). 

We note that the above optimal learning rate g opt (a) contains the parameter a 
unknown to the student. Thus this choice of g(a) is not perfectly consistent with the 
principles of supervised learning. We will propose an improvement on this point in 
section 6 using a parameter-free learning rate. For the moment, we may take the result 
of the present section as a theoretical estimate of the best possible optimization result. 



5.1.2. Asymptotic analysis of the learning curve Let us first investigate the learnable 
case. The asymptotic forms of R, I, e g and g as R —>■ 1 are obtained from the same 
analysis as in the previous section as R = 1 — 8 /a 2 , I = ce" 16 '" 2 and 

e K = — 5.8 

TTCY 

, J , e -16/a 2 

g(a) = 2V2^- = 2c\phx (5.9) 

a a 

where c is a constant depending on the initial condition. The decay rate to vanishing 
generalization error is improved from oT 1 ^ for the unoptimized case |15j to or 1 . This 
a _1 -law is the same as in the off-line (or batch) learning |Tj|. We also see that I 
approaches c as R reaches 1. 

We next investigate the unlearnable case A^O. The asymptotic forms are 

2nH(a) 1 
it — i — — — ■ — 

(l-2A) 2 a (5.10) 

Z = ca -2A/d-2A) 



J2j2irH(a) 1 

7W7H + 2ff "' < 5 ' n > 



and the optimal learning rate g opt is 



x/2¥ a - 2A /(!- 2A ) 

9opt(o0^c — — . (5.12) 



From the asymptotic form of Z, we find that I diverges with a for a < a cl = y / 2Tog2 and 
goes to zero for a > a c i as observed in the previous subsection. It is interesting that, 
for a exactly equal to a c i, g opt vanishes and the present type of optimization does not 
make sense. 
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For a > a c2 = 0.80, the generalization error converges to the optimal value 2H(a) 
as a -1 / 2 . This is the same exponent as that of the Hebbian learning as we saw in the 
previous section. For a < a C 2, in order to get the optimal overlap R = i?*, we must 
stop the on-line dynamics before the system reaches the state R — — 1. Accordingly, 
the method discussed in this section is not useful for the purpose of improvement of 
generalization ability for a < a C 2. 

5.2. Hebbian learning 

The Hebbian learning with learning rate g(a) is 

jm+l = J m + g ( a } Ta ( v ) x . (5.13) 

Using the same technique as in the previous subsection, we find the optimal learning 
rate for the Hebbian learning g^p t (a) as 

h , x /2 '(1 - 2A)(1 - R 2 )l 

= si- K - % L - (5-14) 



The R-l trajectory is 
R 



cl (5.15) 



(1 - iP) 

where c is a constant determined by the initial condition. It is very interesting that this 
trajectory is independent of a. 

The asymptotic forms of various quantities for a > a cl of the Hebbian learning are 

4(1-2A)*« (5.16) 

I = ca 

and 

e„= __ 1 1 — + 2H(a) (5.17) 

g(a) = c. (5.18) 

Accordingly, for a > o c i, the asymptotic form of the generalization error is the same 
as for g — 1. However, in the parameter region a < a cl , the generalization ability 
deteriorates by introducing the optimal learning rate if we select an initial condition 
satisfying R > 0. To see this, we note that dR/da is approximated around R = as 
d-R/do;~2(l — 2A) 2 /7ri? with using g^ pt . Therefore if we start the learning dynamics 
from R > 0, the overlap R goes to 1 and the generalization error approaches 2H(a) 
which is not acceptable at all because it exceeds 0.5. On the other hand, for a < a cl and 
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-Rinit < 0, the generalization error approaches 1 — 2H(a) (less than 0.5 but not optimal) 



as 



e K = = 1 l -= + l-2H(a). (5.19) 

Thus an over-training appears. We must notice that the prefactor of the generalization 
error changes from 1 / \/Qtx in equation (3.13) to 1/ \/2~7r in equation ( |5.19| ) by introducing 



the optimal learning rate. Therefore the optimization by using the learning rate g(a) is 
not very useful for the Hebbian learning. 

6. Optimal learning without unknown parameters 

As we mentioned in section 5, the generalization error obtained there is the theoretical 
(not practical) lower bound because the optimal learning rate g opt contains a parameter 
a unknown to the student. In this section we propose a method to avoid this difficulty 
for the perceptron learning algorithm. 

For the learnable case we choose the learning rate g as 

g = -l 6.1 

a 

which is nothing but the asymptotic form (5.9) of the previous optimized learning rate. 
Substituting this into equation (5.4) with ( |5.5| ), we find R = 1 — 8/a 2 when R is close 
to unity and correspondingly 

4 , , 

e K = — 6.2 

which agrees with the result of Barkai et al [TEH . 

For the unlearnable case, we assume g(a) = kl/at as before and find the general 
solution for R = 1 — e as 

k 2 H(a)l Jk\ hk , s 

ok — 1 a \a 



where b = y2/n(l — 2A). The first term dominates asymptotically if bk > 1. In this 
case, we have 



2k 2 H(a) 1 . s 

e g = 2H(a) + \ - V 7= - 6 - 4 
s V bk - 1 iiy/a 

The second term on the right-hand side is minimized by choosing 

^7T 



- 1-2A < 6 ' 5) 

which satisfies bk > 1 as required. Equation ( |6.4j ) makes sense for A > 2\/log2 if k is 
chosen as above. 
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When bk < 1, the asymptotic form of the generalization error is 

bk/2 

e g = 2H(a) + 




(6.6) 



This formula is valid for b > or a < a c i. Similar crossover between two types of 
asymptotic forms was reported in the problem of one-dimensional decision boundary 



7. Hebbian learning with queries 

We have so far assumed that the student is trained using examples drawn from a uniform 
distribution on the N- dimensional sphere S N . It is known for the learnable case [2(J that 
selecting training examples out of a limited set sometimes improves the performance of 
learning. We therefore investigate in the present section how the method of Kinzel and 



Rujan |20] works for an unlearnable rule. 



7.1. Learning with queries under fixed learning rate 

The learning dynamics we choose here is nothing but the Hebbian algorithm (3.7). In 
section 3, the student was trained by inputs x uniform on S . In the present section 
we follow reference |20] and use selected inputs which lie on the borderline, Jx = 
or u = 0, at every dynamical step. The idea behind this choice is that the student is 
not confident for inputs just on the decision boundary and thus teacher signals for such 
examples should be more useful than generic inputs. 

We use the following conditional distribution, instead of Pr(u,v) in equation ( |2.5|) , 
in order to get the differential equations 

(7.1) 



P R (v\u = 0) = V2tt5(u)P r (u,v). 

Using this distribution, we obtain the next differential equations 
dl 2 



da 



1 



(7.2) 



dR 

da 




(7.3) 



In figure 13, we plotted the generalization error for a = 1.0 by numerical integration 
of the above differential equations. We see that the generalization ability of student is 
improved and the problem of over-training is avoided. 

In order to investigate the asymptotic form of the generalization error, we solve 
the differential equations in the limit of a— >oo. Equation ( |7.2fl can be solved easily as 
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I = yfa. For the learnable case a^oo, using R 
and the generalization error as 

1 1 



2 v / 27T\/a' 



1 — e and e— >0, we obtain e = iz/ (16a) 



(7.4) 



The numerical prefactor has been reduced by a half from equation ( ft.llf) . 
For finite a, equation (7.3) has fixed points at Rq = ±1 and 



(±) 



±1 



/21og2 - a 2 
21og2 



(7.5) 



The latter fixed point exists only for a < a cl = y/2 log 2. Thus, if a > a c i, \R\ eventually 
approaches 1, and the exponential term in equation (|7.3|) can be neglected. This implies 
that the asymptotic analysis for the learnable case applies without modification. The 
resulting asymptotic form of the generalization error is 

11,, / % 

(7-6) 



2V27T 



+ 2H(a) 



If a < a c i, the system is attracted to the fixed point r[ ' according to the expansion 
of the right-hand side of equation (|7.3p around R = 0, 




2A) 



(7.7) 



which is negative if a < a cl . It is remarkable that R^ coincides with R* which gives 
the global minimum of E(R) for a < a C 2 = 0.80. Therefore, for a < a C 2, the present 
Hebbian learning with queries achieves the best possible generalization error. In the 
range a C 2 < a < a c i, R — Ri = R* is not the global minimum of E(R) but is only 
a local minimum. However, as seen in figure 13, over-training has disappeared in this 
region by introducing queries. 

The asymptotic behaviour for a < a c \ is found to be 

71 



161og2v/2Tog2" 



-opt 



<5(2, -log2) 



xexp 



81og2 

a/™ 



21og2 - a 2 x/a 



where Q(x, y) is the incomplete gamma function and the asymptotic value e op t 
is optimal for a < a cl . 



(7.8) 
E(R*) 



7.2. Optimized Hebbian learning with queries 

We next introduce the parameter g into the Hebbian learning with queries and optimize 
g so that R goes to 1 as quickly as possible. As discussed in section 5, this strategy 
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works only for a > a c2 since R = 1 is not the optimal value if a < a c2 . Using the same 
technique as section 5, we find the optimal learning rate as 



,2 



g ^ = ^-VT=irl [ l-2e W \ c — 2 )j. (7.9) 
For the learnable case, the solution for R is 



R=\l-cexp(-—) (7.10) 

V 7T 

where c is a constant. The generalization error decays to zero as 

e g = ^exp(--) (7.11) 

7T 7T 

where c is determined by the initial condition. This exponential decrease for the 
learnable case is in agreement with reference [17| where the optimization of the type 
of equation (|5.2| ) was used together with queries. The asymptotic forms of the order 
parameter I and optimal learning rate g opt are 



Z = c\/l-cexp(- — ) (7.12) 

V 7T 



2c a 

0opt(a) = c\ — exp( ) (7.13) 

V 7T 7T 

where d is determined by the initial condition. 

Next we investigate the case of finite a. Using the same asymptotic analysis as in 
the learnable case, we obtain the asymptotic form of generalization error e g as 

e g = 2#(a) + ^exp(--). (7.14) 

71 71 

The limiting value 2H(a) is the theoretical lower bound for a > a c2 = 0.80. We therefore 
have found a method of optimization to achieve the best possible generalization error 
with a very fast, exponential, asymptotic approach for a > a C 2- The present method 
of optimization does not work appropriately for a < a c2 because R = 1, to which the 
present method is designed to force the system, is not the best value of R in this range 
of a. 

It is worth investigating whether the exponent of decay changes or not by using a 
parameter- free optimal learning rate as in section 7. If a > a&, there exists only one fixed 
point R — 1. Therefore, the a-dependent term exp(— a 2 /(l — R 2 )) in equation ( |7.9| ) does 
not affect the asymptotic analysis. We may therefore conclude that the asymptotic form 
of generalization error does not change by optimal learning rate without the unknown 
parameter a. 
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8. Avoiding over-training by a weight-decay term 

We showed in section 3 that the over-training appears for the unlearnable case a < a c i 
by the Hebbian learning. If a < a c \, the flow of R goes to —1 for any initial condition 
passing through the local minimum of E(R) at R = R*. Consequently, the generalization 
ability of the student decreases as he learns excessively. In order to avoid this difficulty, 
we must stop the dynamics on the way to the state R = —1. For this purpose, we may 
use the on-line dynamics with a weight-decay term or a forgetting term [TJ| . 



The on-line dynamics by the Hebbian rule is modified with the weight-decay term 

as 

J m+1 = (l-^)J m + T a (v)x. (8.1) 

The fixed point of the above dynamics is 

2(1 -2A) . . 

Ro = ; ^ ■ 8.2 

^ttA + 4(1 - 2A) 2 

In order to get the optimal value, we choose Ro so that it agrees with .R* which gives 
the global minimum of E(R) for a < a c \. From this condition, we obtain the optimal 

A-opt &S 

4a*(l-2A)* 

Aopt " 7r(21og2-^)- (8 ' 3) 

Using this A opt , we solve the differential equations numerically and plot the result 
in figure 14 for a = 0.5(< a c i). We see that the over-training disappears and the 
generalization error converges to the optimal value. 

We next investigate how fast this convergence is achieved. For this purpose, we 
linearize the differential equations around the fixed point to obtain 



l-R~(l-Ro) {l + O 



( o2/! oA , 2 ^(21og2-a 2 )+4 ^ 
exp(— 2a (1 — 2Aj — — — a 



7r(21og2 - a 



U) 



We warn here that A opt in equation ( |8.3| ) depends on a which is unknown to the student. 
Therefore, the result obtained in this section gives the theoretical upper bound of the 
generalization ability. 



9. Summary and Discussions 

We have analyzed the problem of on-line learning by the perceptron and Hebbian 
algorithms. For the unlearnable case, the generalization error decays exponentially 
to a finite value E{Rq) with Rq = 1 — 2 A in the case of the perceptron learning. For 
the Hebbian learning, the generalization error decays to 2H(a), the best possible value, 
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for a > a c i and to 1 — 2H(a) for a < a c ±, both proportionally to a -1 / 2 . In this latter 
parameter region a < a c i, we observed the phenomenon of over-training. 

We also investigated the learning under output noise. For the learnable case of the 
perceptron algorithm, the order parameters R and / are attracted toward a fixed point 
(Rq,Io) asymptotically with an exponential law. As a result, the generalization error 
decays to a finite value exponentially. On the other hand, for the unlearnable case of 
the perceptron learning, the generalization error decays exponentially to a finite value 
£'((1 — 2A)(1 — 2A)). For the Hebbian learning, the generalization error decays to 2H(a) 
in proportion to \ j\fa for a > a c \ and to 1 — 2H(a) with also proportionally to 1/y/a 
for a < a c \. 

We introduced the learning rate g(a) in on-line dynamics and optimized it to 
maximize dR/da. By this treatment we obtained a closed form trajectory of R and 
The generalization ability of the student has been shown to increase for a > a c2 = 0.80 
in the case of the perceptron learning algorithm. For the unlearnable case, the 
generalization error decays to the best possible value 2H(a) in proportion to 1/^/a. 
For the Hebbian learning, the asymptotic generalization ability did not change by this 
optimization procedure. 

Unfortunately, in the parameter range a < a c2 , we found it impossible to obtain an 
optimal performance for the perceptron learning within our procedure of optimization. 
To overcome this difficulty, we investigated the on-line dynamics with a weight-decay 
term for the Hebbian learning. Using this method, we could eliminate the over-training, 
and the generalization error converges to the optimal value exponentially. 

We also introduced a new learning rate independent of the unknown parameter a. 
We assumed g(a) = kl/a and optimized k so that the generalization error decays to the 
minimum value as quickly as possible. As a result, for the unlearnable case of a > a cl 
the prefactor was somewhat improved although the exponent of decay did not change. 

The Hebbian learning with queries was also investigated. If the student is trained by 
the Hebbian algorithm using inputs on the decision boundary, his generalization ability 
is improved except in the range a C 2 < a < a c i . This is a highly non-trivial result because 
this choice of query works well for the unlearnable case where student does not know 
the structure of the teacher. We next introduced the optimal learning rate in the on- 
line Hebbian learning with queries and obtained very fast convergence of generalization 
error. For a > a cl , the generalization error converges to its optimal value exponentially. 

We have observed exponential decays to limiting values in various situations of 
unlearnable rules. This fast convergence may originate in the large size of the asymptotic 
space; if the liming value of R is unity, only a single point in the J-space, J = J°, is the 
correct destination of learning dynamics, a very difficult task. If, on the other hand, R 
approaches Ro(< 1), there are a continuous number of allowed student vectors, and to 
find one of these should be a relatively easy process, leading to exponential convergence. 
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Figure captions 



Figure 1. Generalization error as a function of the overlap R for a — oo, 2.0, 1.0, 0.5 
and 0. For a = oo, the generalization error decreases to zero as R goes to 1. For a = 0, 
the generalization error decays to zero as R goes to —1 instead of 1. 



Figure 2. The global minimum value of E{R) which corresponds to the optimal value 
of the generalization error e op t- We also plotted the generalization error obtained by 
perceptron learning with learning rate g = 1. When a = a c \, the generalization error 
under the perceptron algorithm becomes equal to random guess (e g = 0.5). 



Figure 3. The optimal order parameter R which gives the global minimum, namely, 
the optimal generalization error e pt- The system shows a discontinuous phase 
transition at a = a c2 = 0.80 from the phase described by R = 1 to the phase described 
by R = We also plotted R = 1 — 2 A obtained by the perceptron learning with 
learning rate g = 1. When a = a c \, the overlap between the teacher and student 
vanishes. 



Figure 4. Flows of the order parameters R and I for the learnable case (a = oo) by 
the perceptron learning. If one starts from large /, the student begins to generalize 
after the length of the weight vector I decreases to some value. 
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Figure 5. Flows of the order parameters R and I for the unlearnable cases a = 2.0 by 
the perceptron learning. The flows are attracted to a fixed point. 

Figure 6. Flows of R and I for a = oo, 2.0 and 0.5 by the Hebbian learning. For the 
cases of a = oo and 2.0, R reaches 1 and I goes to oo. On the other hand, for a = 0.5, 
R reaches — 1 as I goes to oo. 



Figure 7. Generalization error e g for a = oo,2.0 and 0.5 by the Hebbian learning. 
For a — oo and 2.0, the generalization error converges to the optimal value 2H(a). 
However, in the case of a = 0.5, the generalization error begins to increase when the 
student learns too much (over-training). 



Figure 8. Generalization error for the unlearnable case a — 0.5 with output noise 
A = 0.01 and 0.20 by the Hebbian learning. 



Figure 9. The trajectories in the R-l plane with the optimal learning rate by the 
perceptron learning for a = 2.0. We choose the initial condition as (-Rmit) 'hut) = 
(0.01, 0.10), (0.01, 1.00) and (0.01, 2.00). 



Figure 10. Same as figure 12 with a = 0.5. 



Figure 11. Generalization error for a = 2.0 with the optimal learning rate g opt . 



Figure 12. Same as figure 14 with a — 0.5. If we select a negative value as the initial 
condition of R for a = 0.5, the generalization error converges to 1 — 2H(a)(> 0.5). 



Figure 13. Generalization error of the Hebbian learning with queries for a = 1.0. 
Over-training disappeared and the generalization error converges to its optimal value. 



Figure 14. Generalization error of the Hebbian learning with a weight-decay term for 
a = 0.5. Over-training disappeared and generalization error converges to its optimal 
value. 



E(R) 




Figure 1 

INOUE, NISHIMORI and KABASHIMA 




Figure 2 

INOUE, NISHIMORI and KABASHIMA 



R 



0.5 







-0.5 



-1 



- 


1 - 1 



y' Optimal learning 

/ Perceptron learning 

/ 

/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 
/ 


i 

/ 
/' 

/ 
/ 

/ / 
/ / 


/ 

/ 

/ 
/ 
/ 
/ 

i 





1 

1 


1 I I 

2 3 4 i 



del CLcl a 



Figure 3 

INOUE, NISHIMORI and KABASHIMA 



1 2 3 

/ 

Figure 4 

INOUE, NISHIMORI and KABASHIMA 




Figure 5 

INOUE, NISHIMORI and KABASHIMA 




Figure 6 

INOUE, NISHIMORI and KABASHIMA 




Figure 7 

INOUE, NISHIMORI and KABASHIMA 




10 20 30 40 



a 

Figure 8 

INOUE, NISHIMORI and KABASHIMA 



R 




10 20 30 40 



/ 

Figure 9 

INOUE, NISHIMORI and KABASHIMA 



1 

0.8 
0.6 
0.4 
0.2- 


200 400 600 800 1000 

/ 

Figure 10 

INOUE, NISHIMORI and KABASHIMA 





a 



Figure 11 



INOUE, NISHIMORI and KABASHIMA 



0.75- 



0.5- 



S=l 

8 =8 o P t 

8 =8 opt , R init. < 



0.25- 



0- 







10 



20 



30 



40 



a 



Figure 12 



INOUE, NISHIMORI and KABASHIMA 



eg 

0.5-, 



A=0 

A=A p t 



0.4- 



0.3- 




10 20 30 40 

a 

Figure 14 

INOUE, NISHIMORI and KABASHIMA 



